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Abstract 

The Bayesian framework is ideally suited for induc- 
tion problems. The probability of observing x t at 
time t, given past observations xi...x t -i can be com- 
puted with Bayes' rule if the true distribution fi 
of the sequences X1X2X3... is known. The problem, 
however, is that in many cases one does not even 
have a reasonable estimate of the true distribution. 
In order to overcome this problem a universal dis- 
tribution £ is defined as a weighted sum of distri- 
butions Hi € M, where M is any countable set of 
distributions including fj,. This is a generalization of 
Solomonoff induction, in which M is the set of all 
enumerable semi-measures. Systems which predict 
yt, given xi...xt-i and which receive loss lx t y t if x t i s 
the true next symbol of the sequence are considered. 
It is proven that using the universal f as a prior is 
nearly as good as using the unknown true distribu- 
tion fi. Furthermore, games of chance, defined as 
a sequence of bets, observations, and rewards are 
studied. The time needed to reach the winning zone 
is bounded in terms of the relative entropy of /i and 
£. Extensions to arbitrary alphabets, partial and 
delayed prediction, and more active systems are dis- 
cussed. 



1.1 Induction 

Many problems are of induction type, in which statements 
about the future have to be made, based on past obser- 
vations. What is the probability of rain tomorrow, given 
the weather observations of the last few days? Is the 
Dow Jones likely to rise tomorrow, given the chart of the 
last years and possibly additional newspaper information? 
Can we reasonably doubt that the sun will rise tomorrow? 
Indeed, one definition of science is to predict the future, 
where, as an intermediate step, one tries to understand 
the past by developing theories and, as a consequence of 
prediction, one tries to manipulate the future. All induc- 
tion problems may be studied in the Bayesian framework. 
The probability of observing Xt at time t, given the ob- 
servations x\...Xt-\ can be computed with Bayes' rule, 
if we know the true probability distribution of observa- 
tion sequences x\Xix^.... The problem is that in many 
cases we do not even have a reasonable guess of the true 
distribution fi. What is the true probability of weather 
sequences, stock charts, or sunrises? 



1.2 Universal Sequence Prediction 



Solomonoff [ 5ol64| had the idea to define a universal prob- 
ability distribution^ £ as a weighted average over all pos- 
sible computable probability distributions. Lower weights 
were assigned to more complex distributions. He unified 
Epicurus' principle of multiple explanations, Occams' ra- 
zor, and Bayes' rule into an elegant formal theory. For 
a binary alphabet, the universal conditional probability 
used for predicting x± converges to the true conditional 
probability for t — > 00 with probability 1. The conver- 
gence serves as a justification of using £ as a substitution 
for the usually unknown fi. The framework can easily 



1 Introduction 



1 We use the term distribution slightly imprecisely for a probabil- 
ity measure. 
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be generalized to other probability classes and weights 
|Sol78|. 



1.3 Contents 

The main aim of this work is to prove expected loss 
bounds for general loss functions which measure the per- 
formance of £ relative to p, and to apply the results to 
games of chance. Details and proofs can be found in 
HutOl]. There are good introductions and surveys of 



Solomonoff sequence prediction [LV97|, inductive infer- 



ence [AS83, 3ol97|, reasoning under uncertainty [Grii98 



and competitive online statistics |Vov99| with interesting 
relations to this work. See |Hut01] and subsection 5.4 for 
details. 

Section § explains notation and defines the generalized 
universal distribution £ as the weighted sum of prob- 
ability distributions pi of a set M , which must include 
the true distribution p. This generalization is straight- 
forward and causes no problems. £ multiplicatively dom- 
inates all Hi £ M , and the relative entropy between p and 
£ is bounded by In ^_ . Convergence of £ to /x is shown in 
Theorem |l|. 

Section § considers the case where a prediction or ac- 
tion yt&y results in a loss l Xt y t ^ x t is the next symbol of 
the sequence. Optimal universal and optimal informed 
A M prediction schemes are defined for this case and loss 
bounds are proved. Theorems || and [| bound the total 
loss of by the total loss L M of A M plus 0(^/17^) 
terms. 

Section ^ applies Theorem || to games of chance, de- 
fined as a sequence of bets, observations, and rewards. 
The average profit pnA^ achieved by the A^ scheme 
rapidly converges to the best possible average profit pnK^ 
achieved by the A M scheme {pnh^ —pnh,, =0(n~ 1 / 2 )). If 
there is a profitable scheme at all, asymptotically the uni- 
versal Aj scheme will also become profitable. Theorem |] 
lower bounds the time needed to reach the winning zone 
in terms of the relative entropy of p and £. An attempt 
is made to give an information theoretic interpretation of 
the result. 

Section ||] outlines possible extensions of the presented 
theory and results. They include arbitrary alphabets, 
partial, delayed and probabilistic prediction, classifica- 
tion, even more general loss functions, active systems in- 
fluencing the environment, learning aspects, and a com- 
parison to the weighted majority algorithm(s) and loss 
bounds. 



2 Setup and Convergence 



2.1 Strings and Probability Distributions 

We denote binary strings by x\X2---x n with xt G 
{0, 1}. We further use the abbreviations x n:m :— 
x n x n+ i...x m -ix m and x <n :— x\...x n -\. We use Greek 
letters for probability distributions. Let p{x\.,t) be the 
probability that an (infinite) sequence starts with x\...xt- 
The conditional probability 



p{x t \x 



<t 



P(xi-.t) 
P{x<t) 



(1) 



that a given string x\...x t -i is continued by x t is ob- 
tained by using Bayes' rule. The prediction schemes will 
be based on these posteriors. 

2.2 Universal Prior Probability Distribu- 
tion 

Every inductive inference problem can be brought into 
the following form: Given a string x <t , take a guess at its 
continuation xt- We will assume that the strings which 
have to be continued are drawn from a probability^] dis- 
tribution p. The maximal prior information a prediction 
algorithm can possess is the exact knowledge of p, but in 
many cases the true distribution is not known. Instead, 
the prediction is based on a guess p of p. We expect that 
a predictor based on p performs well, if p is close to p 
or converges, in a sense, to p. Let M:={pi,p2, • ■■} be a 
finite or countable set of candidate probability distribu- 
tions on strings. We define a weighted average on M 



} J = 1, > 0. 



(2) 



It is easy to see that £ is a probability distribution as 
the weights w lli are positive and normalized to 1 and the 
Pi€M are probabilities. For finite M a possible choice for 
the w is to give all pi equal weight (w fli = jm)- We call 
£ universal relative to M, as it multiplicatively dominates 
all distributions in M 

£(%i-.n) > w^-pi(xi-. n ) for all pi G M. (3) 

In the following, we assume that M is known and contains 
the true distribution, i.e. p G M. This is not a serious 
constraint if we include all computable probability distri- 
butions in M with a high weight assigned to simple pi. 

2 This includes deterministic environments, in which case the 
probability distribution /i is 1 for some sequence xi :00 and for 
all others. We call probability distributions of this kind determin- 
istic. 
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Solomonoff's universal semi-measure is obtained if we in- 
clude all enumerable semi-measures in M with weights 
w fJli r^2~ K ^ i \ where K(pA is the length of the shortest 
program for fii po!64 ^ol78| , |LV97Q . A detailed discus- 
sion of various general purpose choices for M is given in 
| HutO| . 

Furthermore, we need the relative entropy between \i 
and £: 



h t (x 



E 

x t e{o,i} 



fx(x t \x <t )ln 



H(x t \x <t ) 
€(xt\x <t ) 



(4) 



H n is then defined as the sum-expectation, for which the 
following can be shown 

n 1 

H n := ^ fJ,(x<t)-ht(x <t ) < In — =: (5) 

f=1 3;< t G{0,l} t - 1 ^ 

The following theorem shows the important property of 
£ converging to the true distribution ji, in a sense. 

Theorem 1 (Convergence) Let there be binary se- 
quences x\X2... drawn with probability (J,(xi :n ) for the first 
n symbols. The universal conditional probability £(xt\x<t) 
of the next symbol Xt given x <t is related to the true con- 
ditional probability fi(x t \x <t ) in the following way: 



< H„ < d„ = In — < oo 



ii) £(x t \x <t ) -> n(x t \x <t ) 



for t — > oo with 
H probability 1 



where H n is the relative entropy and is the weight 
# of Hint 



(i) and (||) are easy generalizations of [ 5ol78 1 to arbi- 
trary weights and an arbitrary probability set M. For 
n — > oo the l.h.s. of (z) is an infinite i-sum over positive 
arguments, which is bounded by the finite constant d M on 
the r.h.s. Hence the arguments must converge to zero for 
t — > oo. Since the arguments are p expectations of the 
squared difference of £ and p, this means that £,(xt\x<t) 
converges^ to ii{ x t\x<t) with /i probability 1. This proves 
(ii). Since the conditional probabilities are the basis of all 
prediction algorithms considered in this work, we expect 
a good prediction performance if we use £ as a guess of 
fi. Performance measures are defined in the next section. 



3 More precisely £(xt\x < t}- fi(xt\x < t) converges to zero for t — ► oo 
with fi probability 1 or, more stringent, in a mean squared sense. 



3 Loss Bounds 

3.1 Unit Loss Function 

A prediction is very often the basis for some decision. 
The decision results in an action, which itself leads to 
some reward or loss. If the action itself can influence the 
environment we enter the domain of acting agents which 
has been analyzed in the context of universal probability 
in |HutOC]. To stay in the framework of (passive) pre- 
diction we have to assume that the action itself does not 
influence the environment. Let l Xtyt £lRbe the received 
loss when taking action yt^y and x t £ {0, 1} is the t th 
symbol of the sequence. We demand I to be normalized, 
i.e. < Ixtvt — 1- For instance, if we make a sequence 
of weather forecasts {0, 1} = {sunny, rainy} and base our 
decision, whether to take an umbrella or wear sunglasses 
y = {umbrella, sunglasses} on it, the action of taking the 
umbrella or wearing sunglasses does not influence the fu- 
ture weather (ignoring the butterfly effect). Reasonable 
losses may be 



Loss 


sunny 


rainy 


umbrella 


0.3 


0.1 


sunglasses 


0.0 


1.0 



In many cases the prediction of xt can be identified 
or is already the action y t . The forecast sunny can be 
identified with the action wear sunglasses, and rainy with 
take umbrella. In the following, we assume "predictive" 
actions of this kind, i.e. y = {0, 1}. General ac tion sp aces 
y and general alphabets A are considered in [Hut01|. 

The true probability of the next symbol being x t , given 
x <t , is fj,(x t \x <t ). The expected loss when predicting y t 
is n(M x <t)hy t +A*(0|a;<t)^ot/t • The goal is to minimize the 
expected loss. More generally we define the A p prediction 
scheme 



arg min N 

yt , 
x t e{o,i} 



p(x t \x <t )l Xt 



yt 



(6) 



which minimizes the p-expected loss. This is a thresh- 
old strategy with yf p =0/1 for j(?(l|a;<t) < 7, where 7 := 
■; l m-iaa — — a s the true distribution is u, the actual 

lot— too-HiQ— til r 

/i expected loss when A p predicts the t th symbol and the 
total /i-expected loss in the first n predictions are 

ltA p (x <t ) := ^2n{xt\x <t )l Xty ^ P , 

xt 

n 

L "A„ := ^2^2 ^J^(x<t)■ltA p (x < t)■ 
t=l x<t 



(7) 



In the special case ?oi = '10 = 1 an d ^oo = hi = 0, the bit 
with the highest p probability is predicted (7 = \ ) , and 
L„A is the total expected number of prediction errors. 
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If /i is known, A M is obviously the best prediction 
scheme in the sense of achieving minimal expected loss 

L n A„ < L nAp for any A p (8) 

The predictor Ag, based on the universal distribution £, 
is of special interest. 

Theorem 2 (Unit loss bound) Let there be binary se- 
quences X\%<x... drawn with probability fi(xi-. n ) for the first 
n symbols. A system predicting y t € {0, 1} given x <t re- 
ceives loss l Xt y t 6 [0> 1] tf x t * s the true t th symbol of the se- 
quence. The A p -system ^) predicts as to minimize the p- 
expected loss. A^ is the universal prediction scheme based 
on the universal prior £. A M is the optimal informed pre- 
diction scheme. The total fi-expected losses L nAlL of A^ 
and L nA of A M as defined in (Q) are bounded in the fol- 
lowing way 

< L„a £ - L nK < H n + v /4L„A M i?„ + Hi 

where H n < In — is the relative entropy (Q), and is 
the weight ^) of ji m £. 

First, we observe that the total loss L ooAi of the uni- 
versal Aj predictor is finite if the total loss L ooA of the 
informed A M predictor is finite. This is especially the 
case for deterministic \x and Iqq = In = 0, as L nA/j = 
in this casef^|, i.e. A^ receives a finite loss on determin- 
istic environments if a correct prediction results in zero 
loss. More precisely, L ooA( < 2ii 00 < 2hi— . A com- 
binatoric argument shows that there are M and /i G M 
with LooAj >log 2 \M\. This shows that the upper bound 
^ooA 5 <21n|M| for uniform w is rather tight. For more 
complicated probabilistic environments, where even the 
ideal informed system makes an infinite number of errors, 
the theorem ensures that the loss excess L nAi — L nAp is 
only of order y/L nAp . The excess is quantified in terms 
of the information content H n of /i (relative to £) , or the 
weight of [i in £. This ensures that the loss densities 
L n /n of both systems converge to each other for n^oo. 
Actually, the theorem ensures more, namely that the quo- 
tient converges to 1, and also gives the speed of conver- 

— 1/2 

gence L nA jL nAfi = 1 + 0(L nA '^ ) — > 1 for L„ Afl -> oo. 

3.2 Proof Sketch of Theorem |] 

The first inequality in Theorem [2] has already been proved 
(||) . For the second inequality, let us start more modestly 

4 Remember that we named a probability distribution determin- 
istic if it is 1 for exactly one sequence and for all others. 



and try to find constants A > and B > that satisfy the 
linear inequality 

L nA( < (A + l)L nA „ + {B+l)H n . (9) 

If we could show 

ltA t (x <t ) < A'l tK (x <t ) + B'h t (x <t ) (10) 

with A' := A + 1 and B' := B + 1 for all f < n and all 
x<t, (|^) would follow immediately by summation and the 
definition of L n and H n . With the abbreviations 

i = x t , yi = n(x t \x <t ), Zi = £(x t \x <t ) 

A M A 5 

m = y t " , s = y t 

the loss and entropy can be expressed by lt Ai = Xli2/»^s> 
ItAa = J2iVi l im and h t = Y.iVi^Tl- Inserting this into 
( |l0| ) and rearranging terms we have to prove 

l l ? 

fl/ S)tf< Ijl - + Sw(^'U-i<.) > 0. (11) 

i=0 1 i=0 

By definition (^J) of y^ and y t s we have 
^ Vikm < ^ and X! Zilis - 

^ ^ Z%lij (1 2) 
i i i i 

for all j. Actually, we need the first constraint only for j = 
s and the second for j = m. The cases li m > li S \/i and li S > 
li m Vi contradict the first/second inequality (|l2|). Hence 
we can assume lo rn > Iq s and l\ m < h s . The symmetric 
case lom < ^0s and l\ m > l\ s is proved analogously or can 
be reduced to the first case by renumbering the indices 
(0 «-» 1). Using the abbreviations a :=Zo m —^osj b\=l\ s — lx m , 
c:=yihm+yolos, ?/ = J/i = l-2/o and z = zi = l-z we can 
write ( |TT| ) as 

f(y,z) ■= (13) 

S'[ylnf + (l-y)ln±E%] + A'(l-y)a-yb + Ac > 

for zb < (1 — z)a and < a, 6, c, y, z < 1. The constraint 
( |l2| ) on ?/ has been dropped since ( fl3| ) will turn out to 
be true for all y. Furthermore, we can assume that d:= 
A'(l — y)a — yb<0 since for d > 0, / is trivially positive 
(/i t >0). Multiplying d with a constant > 1 will decrease 
/. Let us first consider the case z < \ . We multiply the 
d term by 1/b > 1, i.e. replace it with A' (I — y)| — y. 
From the constraint on z we known that r > ■ We can 
decrease / further by replacing ^ by and by dropping 
Ac. Hence, (|l3| ) is proved for z < i if we can prove 

B'[...]+A'(l-y)j^-y > for z<§. (14) 
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The case z > | is treated similarly. We scale d with 1/a > 
1, i.e. replace it with A'(l — y) — y^. From the constraint 
on z we know that - < ^ = ^- We decrease / further by 
replacing ^ by and by dropping ^4c. Hence (|l^) is 
proved for z > ^ if we can prove 



B'\ 



A'{l-y) -yl=* > for z > 



(15) 



In [Hut01| we prove that @ and © indeed hold for S > 



tA- 



The cautious reader may check the inequalities 



numerically. So in summary we proved that (|9|) holds for 
B > j A+ 4 . Inserting B = ^A+j^ into (g) and minimizing 
the r.h.s. with respect to A leads to the bound of Theorem 
| (with A 2 = H n /{L n ^ + \H n )) □. 

3.3 General Loss 

There are only very few restrictions imposed on the loss 
Ixtyt m Theorem namely that it is static and in the 
unit interval [0, 1]. If we look at the proof of Theorem 
we see that the time-independence has not been used at 
all. The proof is still valid for an individual loss function 
l t Xtyt 6 [0, 1] for each step t. The loss might even depend 
on the actual history x <t . The case of a loss i* (%<t) 
bounded to a general interval [l m in> Imax] can be reduced 
to the unit interval case by rescaling I. We introduce a 
scaled loss I' 



o < ■= 



where 1/ 



^xty t ( x < tS ) ~ h 



< 1, 



I r, 



The prediction scheme A' p based on V is identical to the 
original prediction scheme A p based on I, since argmin 
in (0) is not affected by a constant scaling and a shift 

of its argument. From y t p — y^ p it follows that l' tA ^ = 
(ltA p —lmin)/lA and L' nAp = (L n \ p —l m in)/lA (H' n = H n , 
since I is not involved) . Theorem is valid for the primed 



quantities, since V G [0, 1]. Inserting L' nA 
ing terms we get 



and rearrang- 



Theorem 3 (General loss bound) Let there be binary 
sequences X\X?,.» drawn with probability fJ,(xi :n ) for the 
first n symbols. A system taking action (or predicting) 
y t E y given x <t receives loss ^ tat (x< t ) £ [l mm ,l mm + 
I a] if Xt is the true t th symbol of the sequence. The A p - 
system (Q) acts (or predicts) as to minimize the p-expected 
loss. A^ is the universal prediction scheme based on the 
universal prior £. A p is the optimal informed prediction 
scheme. The total \i- expected losses L n ^ and L n \ of A^ 
and A p as defined in (Q) are bounded in the following way 



< L r 



L 



nA„ 



< 



< l A H n + i(L nAii -nl min )l A H n + l\H% 

where H n < In ^- is the relative entropy (Q), and is 
the weight M) of fi in £. 

4 Application to Games of 
Chance 

4.1 Introduction/Example 

Think of investing in the stock market. At time t an 
amount of money St is invested in portfolio yt, where we 
have access to past knowledge x<t (e.g. charts). After our 
choice of investment we receive new information x%, and 
the new portfolio value is rt- The best we can expect is to 
have a probabilistic model /i of the behaviour of the stock- 
market. The goal is to maximize the net /i-expected profit 
p t = r t — s t . Nobody knows fj,, but the assumption of all 
traders is that there is a computable, profitable \x they 
try to find or approximate. From Theorem ^ we know 
that Solomonoff's universal prior ^(xt\x<t) converges to 
any computable //(xt|x<t) with probability 1. If there is 
a computable, asymptotically profitable trading scheme 
at all, the Aj scheme should also be profitable in the long 
run. To get a practically useful, computable scheme we 
have to restrict M to a finite set of computable distri- 
butions, e.g. with bounded Levin complexity Kt [LV97|. 
Although convergence of £ to /i is pleasing, what we are 
really interested in is whether A^ is asymptotically prof- 
itable and how long it takes to become profitable. This 
will be explored in the following. 

4.2 Games of Chance 

We use Theorem || (or its ge neraliz ation to arbitrary ac- 
tion and alphabet, proved in [HutOl |) to estimate the time 
needed to reach the winning threshold when using A^ in 
a game of chance. We assume a game (or a sequence 
of possibly correlated games) which allows a sequence of 
bets and observations. In step t we bet, depending on 
the history x <t , a certain amount of money St, take some 
action yt, observe outcome xt, and receive reward rt- Our 
profit, which we want to maximize, is pt — rt—st. The loss, 
which we want to minimize, can be defined as the negative 
profit, l Xtyt = —pt. The probability of outcome Xt, possi- 
bly depending on the history x<t, is /i(xt|x<t). The total 
fi expected profit when using scheme A p is P n A p = ~L n \ . 
If we knew fj,, the optimal strategy to maximize our ex- 
pected profit is just A M . We assume P ra A M >0 (otherwise 
there is no winning strategy at all, since P n A M > PnA P Vp) • 
Often we are not in the favorable position of knowing p, 
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but we know (or assume) that /x € M for some M, for 
instance that fx is a computable probability distribution. 
From Theorem ^ we see that the average profit per round 
PnA 5 := ^P n Ai= of the universal scheme converges to 
the average profit per round p n A p := ^-P n A t , of the op- 
timal informed scheme, i.e. asymptotically we can make 
the same money even without knowing fx, by just using 
the universal scheme. Theorem [| allows us to lower 
bound the universal profit P n Ac 



PnA$ PnA p 



i(np max - P nAp )PAH n +p\Hl 

(16) 

where p m ax is the maximal profit per round and pa the 
profit range. The time needed for Aj to perform well can 
also be estimated. An interesting quantity is the expected 
number of rounds needed to reach the winning zone. Us- 
ing P n A p > one can show that the r.h.s. of ( |T6| ) is positive 
if, and only if 



n > 



max 



f) 2 

PnA u 



(17) 



Theorem 4 (Time to Win) Let there be binary se- 
quences x\Xi--- drawn with probability fx(x\. n ) for the first 
n symbols. In step t we make a bet, depending on the his- 
tory x<t, take some action y t , and observe outcome Xt- 
Our net profit is pt€[p m ax-PA, Pmax}- The A p - system ffi) 
acts as to maximize the p- expected profit. P n A p is the to- 5 Outlook 
tal and p n A p = ^PnA p is the average expected profit of the 
first n rounds. For the universal A^ and for the optimal 
informed A u prediction scheme the following holds: 



4.3 Information-Theoretic Interpretation 

We try to give an intuitive explanation of Theorem |](ii). 
We know that £(cct|a;<i) converges to fx(x t \x <t ) for t — » 
oo. In a sense A^ learns [x from past data x <t . The 
information content in fx relative to £ is In < d^-ln 2. 
One might think of a Shannon- Fano prefix code of fa G M 
of length ^d p An 2^ , which exists since the Kraft inequality 

^ 2 - r( W ln21 <Ei^« < 1 is satisfied. d^-lnZ bits have 
to be learned before A^ can be as good as A M . In the worst 
case, the only information contained in x t is in form of 
the received profit pt- Remember that we always know 
the profit pt before the next cycle starts. 

Assume that the distribution of the profits in the in- 
terval [pmimPmax] is mainly due to noise, and there is 
only a small informative signal of amplitude p n A p ■ To 
reliably determine the sign of a signal of amplitude p n A , 
disturbed by noise of amplitude pa, we have to resub- 
mit a bit 0((pA/p n A p ) 2 ) times (this reduces the stan- 
dard deviation below the signal amplitude p n A p )- To 
learn li, d M ln2 bits have to be transmitted, which re- 
quires n > 0{{pA/PnA p ) 2 ) 'd^ In 2 cycles. This expression 
coincides with the condition in (ii). Identifying the signal 
amplitude with pnA, is the weakest part of this consider- 
ation, as we have no argument why this should be true. 
It may be interesting to make the analogy more rigorous, 
which may also lead to a simpler proof of (ii) not based 
on Theorems ^| and |^. 



In the following we discuss several directions in which the 
findings of this work may be extended. 



PnA^ — PnA p 0(n 1 / 2 ) 



PnA,, 



for n~>oo 5.1 General Alphabet 



) if n >{j^) -dfi and p n ^ > =>■ p„A 4 > 
where w p — e~ dfi is the weight |^) of ll in £. 

By dividing (|l^) by n and using H n < d p @ we 
see that the leading order of p n A^ —PnA, is bounded by 
\J ^PAPmaxd^/n, which proves (i). The condition in (ii) 
is actually a weakening of ([l7]). PnA^ is trivially positive 
for p m i n > 0, since in this wonderful case all profits are 
positive. For negative p m in the condition of (ii) implies 
(pr|), since pa> Pmax, and ( |l7| ) implies positive (|l6|), i.e. 
PnA ( >0, which proves (ii). 

If a winning strategy A p with p n A p > £ > exists, 
then Aj is asymptotically also a winning strategy with 
the same average profit. 



In many, cases the prediction unit is not a bit, but a letter 
from a finite alphabet A. Non-binary prediction cannot 
be (easily) reduced to the binary case. One might think 
of a binary coding of the symbols Xt S A in the sequence 
X\X2-... But this makes it necessary to predict a block of 
bits Xt, before one receives the true block of bits Xt, which 
differs from the bit by bit prediction, considered here and 
in | Sol78 1 ! Fortu nately, all theorems (0^) take over to 
general alphabet [HutOl]. Unfortunately, the proofs are 
rather complex. In many cases the basic prediction unit 
is not even a letter from a finite alphabet, but a number 
(for inducing number sequences), or a word (for complet- 
ing sentences), a real number or vector (for physical mea- 
surements). The prediction may either be generalized to 
a block by block prediction of symbols or, more suitably, 
the finite alphabet A could be generalized to countable 



G 



(numbers, words) or continuous (real or vector) alphabet. 
The theorems should generalize to countably infinite al- 
phabets by appropriately taking the limit \A\ — ► oo and 
to continuous alphabets by a denseness or separability 
argument. 

5.2 Partial Prediction, Delayed Predic- 
tion, Classification 



x\ in the unit interval [0, 1] based on past observations 
x\Xi---Xt-\. The loss of expert e in step t is defined as 

\x t ■ 
\x t 



The A„ schemes may also be used for partial prediction total loss L p (x) :— X^t=i \ x t x t 



Cj|. In the case of binary predictions x\ G {0,1}, 
x\\ coincides with our error measure defined in 
[HutOlj. The WM algorithm pp n combines the predic- 
tions of all experts. It forms its own prediction 
cording to some weighted average of the expert's predic- 
tions x\. There are certain update rules for the weights 
depending on some parameter j3. Various bounds for the 

of WM in terms of the 



where, for instance, only every m th symbol is predicted. 
This can be arranged by setting the loss I to zero when no 
prediction is made, e.g. if t is not a multiple of m. Classifi- 
cation could be interpreted as partial sequence prediction, 
where £(t-i) m +i : fe m -i is classified as x^ m . There are bet- 
ter ways for classification by treating gq— lj mjjjfcm — l as 
pure conditions in £, as has been done in [HutOO] in a 
more general context. Another possibility is to generalize 
the prediction schemes and theorems to delayed sequence 
prediction, where the true symbol xt is given only in cycle 
t + d. A delayed feedback is common in many practical 
problems. 

5.3 More Active Systems 

Prediction means guessing the future, but not influencing 
it. We mentioned the possibility of interpreting yt^y as 
an action with y ^ A. This tiny step towards a more 
active system is described in more detail in [Hut01|. The 
probability /i is still independent of the action, and the 
loss function I has to be known in advance. This ensures 
that the greedy strategy (Q) is optimal. The loss function 
may be generalized to depend not only on the history 
a;<t, but also on the historic actions y<t with /i still in- 
dependent of the action. It would be interesting to know 
whether the scheme A and/or the loss bounds generalize 
to this case. The full model of an acting agent influencing 
the environment has been developed in [ HutOO |, but loss 
bounds have yet to be proven. 



total loss i e (x) :=X)"=i \ x t~ xf \ °f the best expert eg£ 
have been proven. It is possible to fine tune (3 and to elim- 
inate the necessity of knowingn in advance. The most 



general bound of this kind is |Ces97 



L p (x) < L e (x) + 2.81n|£| +4 v / i e (x) ln|£|. (18) 

It is interesting that our bound in Theorem |^ (with 
H n < In \ M\ for uniform weights) has a quite similar 
structure as this bound, although the algorithms, the set- 
tings, the proofs and the interpretation are quite different. 
Whereas WM performs well in any environment, but only 
relative to a given set of experts £ , our A^ predictor com- 
petes with the best possible A M predictor (and hence with 
any other p predictor) , but only for a given set of environ- 
ments M. WM depends on the set of expert, A^ depends 
on the set of environments M. The basic pp n algorithm 
has been extended in different directions: incorporation 
of different initial weights [\ £\ ^ ln -^) flLW89[ |Vov92| , 



more gene ral loss fu nctions [HKW98], continuous valued 
outcom es HKW98 |, and multi-dimensional predictions 
[ KW99 | (but not yet for th e absolu te loss). The works 
of Yamanishi [ Yam97 ] and [ Yam98 lie somewhat in be- 
tween WM and this work; "WM" techniques are used to 
prove expected loss bounds (but only for sequences of in- 
dependent symbols/experiments and different classes of 
loss functions) . Finally, note that the predictions of WM 
are continuous. In a sense it is more natural to predict 
or 1 on a binary sequence, rather than some real number. 
On the other hand it is possible to convert the continuous 
prediction of WM into a probabilistic binary prediction 
by interpreting x\ € [0, 1] as the probability of predicting 



5.4 The Weighted Majority Algorithm(s) ^ and w _ x v\ as the probabi i ity f making an error. Note 



The Weighted Majority (WM) algorithm is a related uni- 
versal forecasting algorithm. It was invented by Little- 
stone and Warmuth ]LW89|, [LW94|] and Vovk ]Vov92| and 



HKW98 



that the expectation is taken over the probabilistic pre- 
diction, whereas for the deterministic A^ algorithm the 
expectation is taken over the environmental distribution 



KW99(| and oth- 



furthcr developed in [Ces97, 
ers. Many variations known by many names have mean- 
while been invented. Early works in this direction are 



Daw84 



Ris89|. Sec [Vov99| for a review and further ref- 



erences. The setting and basic idea of WM are the follow- 
ing. Consider a finite binary sequence X\x<z...x n £ {0, 1}™ 
and a finite set £ of experts e £ £ making predictions 



\i. The multi-dimensional case [KW99| could then be in- 
terpreted as a (probabilistic) prediction of symbols over 
an alphabet A= {0, l} d , but error bounds for the abso- 
lute loss have yet to be proven. It would be interesting to 
generalize WM and bound (|l^) to arbitrary alphabet and 
to general loss functions with probabilistic interpretation. 



7 



5.5 Miscellaneous 

Another direction is to investigate the learning aspect of 
universal prediction. Many prediction schemes explicitly 
learn and exploit a model of the environment. Learning 
and exploitation are melted together in the framework 
of universal Baycsian prediction. A separation of these 
two aspects in the spirit of hypothesis learning with MDL 
VLOO] could lead to new insights. The attempt at an 
information theoretic interpretation of Theorem |J may be 
made more rigorous in this or another way. In the end, 
this may lead to a simpler proof of Theorem [| and maybe 
even for the loss bounds. Finally, the system should be 
implemented and tested on specific induction problems 
for specific finite M with computable £. 



6 Summary 

Solomonoff 's universal probability measure has been gen- 
eralized to arbitrary probability classes and weights. A 
wise choice of M widens the applicability by reducing 
the computational burden for £. A framework, where 
predictions result in losses of arbitrary, but known form, 
has been considered. Loss bounds for general loss func- 
tions have been proved, which show that the universal 
prediction scheme A^ can compete with the best possi- 
ble informed scheme A^ . The results show that universal 
prediction is ideally suited for games of chance with a se- 
quence of bets, observations, and rewards. Extensions in 
various directions have been suggested. 
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